A percolate table is a special table that stores queries rather than documents. It is used for prospective searches, or "search in reverse."
- To learn more about performing a search query against a percolate table, see the section Percolate query.
- To learn how to prepare a table for searching, see the section Adding rules to a percolate table.
The schema of a percolate table is fixed and contains the following fields:
Field | Description |
---|---|
ID | An unsigned 64-bit integer with auto-increment functionality. It can be omitted when adding a PQ rule, as described in add a PQ rule |
Query | Full-text query of the rule, which can be thought of as the value of MATCH clause or JSON /search. If per field operators are used inside the query, the full-text fields need to be declared in the percolate table configuration. If the stored query is only for attribute filtering (without full-text querying), the query value can be empty or omitted. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table. |
Filters | Optional. Filters are an optional string containing attribute filters and/or expressions, defined the same way as in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table. |
Tags | Optional. Tags represent a list of string labels separated by commas that can be used for filtering/deleting PQ rules. The tags can also be returned along with matching documents when performing a Percolate query |
Note that you do not need to add the above fields when creating a percolate table.
What you need to keep in mind when creating a new percolate table is to specify the expected schema of a document, which will be checked against the rules you will add later. This is done in the same way as for any other local table.
- SQL
- JSON
- PHP
- Python
- javascript
- java
- C#
- typescript
- go
- CONFIG
CREATE TABLE products(title text, meta json) type='pq';
POST /cli -d "CREATE TABLE products(title text, meta json) type='pq'"
$index = [
'table' => 'products',
'body' => [
'columns' => [
'title' => ['type' => 'text'],
'meta' => ['type' => 'json']
],
'settings' => [
'type' => 'pq'
]
]
];
$client->indices()->create($index);
utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'')
res = await utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'');
utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");
utilsApi.Sql("CREATE TABLE products(title text, meta json) type='pq'");
res = await utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");
apiClient.UtilsAPI.Sql(context.Background()).Body("CREATE TABLE products(title text, meta json) type='pq'").Execute()
table products {
type = percolate
path = tbl_pq
rt_field = title
rt_attr_json = meta
}
Query OK, 0 rows affected (0.00 sec)
{
"total":0,
"error":"",
"warning":""
}
Array(
[total] => 0
[error] =>
[warning] =>
)
A Template Table is a special type of table in Manticore that doesn't store any data and doesn't create any files on your disk. Despite this, it can have the same NLP settings as a plain or real-time table. Template tables can be used for the following purposes:
- As a template to inherit settings in the Plain mode, simplifying your Manticore configuration file.
- Keyword generation with the help of the CALL KEYWORDS command.
- Highlighting an arbitrary string using the CALL SNIPPETS command.
- CONFIG
table template {
type = template
morphology = stem_en
wordforms = wordforms.txt
exceptions = exceptions.txt
stopwords = stopwords.txt
}
⪢ NLP and tokenization
Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.
The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.
On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.
At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.
Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.
Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.
A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.
In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.